Training production parameters of context-dependent phones for speech recognition

نویسنده

  • Mats Blomberg
چکیده

A representation form of acoustic information in a trained phone library at the production parametric as well as the spectral level is described. The phones are trained in the parametric domain and are transformed to the spectral domain by means of a synthesis procedure. By this twofold description, potentially more powerful procedures for speaker adaptation and generation of unseen triphones can be explored, while the more robust spectral representation can be used for recognition. Context-dependent phones are represented by control parameters to a cascade formant synthesiser. During training, the parameters are extracted using an analysis-by-synthesis technique and the trajectories are approximated by piece-wise linear segments. For recognition, the parameter tracks are transformed to a sequence of spectral subphone states, similar to a Hidden Markov model. Recognition is performed by Viterbi search in a finitestate network. Recognition experiments have been performed on Swedish connected-digit strings pronounced by seven male speakers. In one experiment, unseen triphones were created by concatenating monophones and diphones and interpolating the parameter trajectories between line endpoints. In another, speaker adaptation was based on generalisation of dzflerences of observed triphones from the phone library. With optimum weighting of duration information, the results for cross-speaker recognition, speaker adaptation, and multi-speaker training were 98.5%, 98.9% and 99.1 % correct digit recognition, respectively. Preliminary experiments with created unseen triphones show no improvement. In informal listening tests of resynthesised digit strings from concatenation of trained triphones, the speech has been judged as intelligible, however, far from natural.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition

Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...

متن کامل

Allophone-based acoustic modeling for Persian phoneme recognition

Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...

متن کامل

Discriminative models for speech recognition

The discriminative approach to speech recognition offers several advantages over the generative, such as a simple introduction of additional dependencies and direct modelling of sentence posterior probabilities/decision boundaries. However, the number of sentences that can possibly be encoded into an observation sequence can be vast, which makes the application of models, such as support vector...

متن کامل

Minimum confusibility training of context dependent demiphones

During the last years two di erent approaches have been widely used in order to improve the acoustic modeling in continuous speech recognition systems: discriminative training algorithms and context dependent subword units. However, while the use of each of these techniques leads to much better results than standard maximum likelihood trained phone models, their combination, i.e. discriminative...

متن کامل

Context dependent hybrid HMM/ANN systems for large vocabulary continuous speech recognition system

In this paper, hybrid HMM/ANN systems are used to model context dependent phones. In order to reduce the number of parameters as well as to better catch the dynamics of the phonetic segments, we combine (context dependent) diphone models with context independent phone models. Transitions from phone to phone are modeled as generalized context dependent distributions while phonetic units are cont...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007